HaiBiostat: Basic Tools of Multivariate Matching

Hai Nguyen

Motivation

The post here is the re-post in 2020, when I start to research in Causal Inference.
The context leaded me to the topic were:
- After the discussion with my advisor, Dr. Dulal Bhaumik, regarding the direction of my Ph.D. dissertation, we decided to work toward the CAUSAL INFERENCE.
- Matching is the very first aim in Causal Inference: “Obtaining inferring causal effects from observational data using the Matching Approach”. Thus, I have all prepared to learn and write the Matching issues.
- Parallel, I have been also analyzing the HR Exercise in Congenital Heart Disease patient vs. Normal person applying the matching problem. Maybe, revisit the matching problem in Breast Cancer nested case-control study last summer (using SAS) if I still have time during this Summer 2020.

A Small Example

Welders are exposed to chromium and nickel, substances that can cause inappropriate links between DNA and proteins, which in turn may disrupt gene expression or interfere with replication of DNA. Costa, Zhitkovich, and Toniolo [10] measured DNA-protein cross-links in samples of white blood cells from 47 subjects (all male). 01

Unmatched data for 21 railroad arc welders and 26 potential controls. Covariates are age, race (C=Caucasian, AA=African American), current smoker (Y=yes, N=no). Response is DPC=DNA-protein cross-links in percent in white blood cells. All 47 subjects are male.

Welders — Controls

ID	Age	Race	Smoker	DPC	ID	Age	Race	Smoker	DPC
1	38	C	N	1.77	1	48	AA	N	1.08
2	44	C	N	1.02	2	63	C	N	1.09
3	39	C	Y	1.44	3	44	C	Y	1.1
4	33	AA	Y	0.65	4	40	C	N	1.1
5	35	C	Y	2.08	5	50	C	N	0.93
6	39	C	Y	0.61	6	52	C	N	1.11
7	27	C	N	2.86	7	56	C	N	0.98
8	43	C	Y	4.19	8	47	C	N	2.2
9	39	C	Y	4.88	9	38	C	N	0.88
10	43	AA	N	1.08	10	34	C	N	1.55
11	41	C	Y	2.03	11	42	C	N	0.55
12	36	C	N	2.81	12	36	C	Y	1.04
13	35	C	N	0.94	13	41	C	N	1.66
14	37	C	N	1.43	14	41	AA	Y	1.49
15	39	C	Y	1.25	15	31	AA	Y	1.36
16	34	C	N	2.97	16	56	AA	Y	1.02
17	35	C	Y	1.01	17	51	AA	N	0.99
18	53	C	N	2.07	18	36	C	Y	0.65
19	38	C	Y	1.15	19	44	C	N	0.42
20	37	C	N	1.07	20	35	C	N	2.33
21	38	C	Y	1.63	21	34	C	Y	0.97
					22	39	C	Y	0.62
					23	45	C	N	1.02
					24	42	C	N	1.78
					25	30	C	N	0.95
					26	35	C	Y	1.59

Load dataset

# loadind data
gen.tox <- read_excel("data/table81.xlsx", sheet = "data")

data.tab <- gen.tox[c(1:3, 22:24),]
kable(data.tab, caption = "Generic Toxicology Dataset (6-observation example)") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

(#tab:load data)Generic Toxicology Dataset (6-observation example)
ID	Group	Age	Race	Smoker	DPC
1	Welders	38	C	N	1.77
2	Welders	44	C	N	1.02
3	Welders	39	C	Y	1.44
22	Controls	48	AA	N	1.08
23	Controls	63	C	N	1.09
24	Controls	44	C	Y	1.10

Descriptive Statistics

m.age.w <- round(mean(gen.tox$Age[gen.tox$Group=="Welders"]),0)
m.age.c <- round(mean(gen.tox$Age[gen.tox$Group=="Controls"]),0)
aa.p.w <- round(table(gen.tox$Race[gen.tox$Group=="Welders"])[1]/sum(table(gen.tox$Race[gen.tox$Group=="Welders"]))*100,0)
aa.p.c <- round(table(gen.tox$Race[gen.tox$Group=="Controls"])[1]/sum(table(gen.tox$Race[gen.tox$Group=="Controls"]))*100,0)
s.p.w <- round(table(gen.tox$Smoker[gen.tox$Group=="Welders"])[2]/sum(table(gen.tox$Smoker[gen.tox$Group=="Welders"]))*100,0)
s.p.c <- round(table(gen.tox$Smoker[gen.tox$Group=="Controls"])[2]/sum(table(gen.tox$Smoker[gen.tox$Group=="Controls"]))*100,0)

Welders — Controls

Mean Age	AA	Smoker	—	Mean Age	AA	Smoker
38	10	52	—	43	19	35

t.test(Age ~ Group, data=gen.tox)


    Welch Two Sample t-test

data:  Age by Group
t = 2.2537, df = 42.167, p-value = 0.02947
alternative hypothesis: true difference in means between group Controls and group Welders is not equal to 0
95 percent confidence interval:
 0.4662145 8.4422104
sample estimates:
mean in group Controls  mean in group Welders 
              42.69231               38.23810

tbl.age.group <- table(gen.tox$Group, gen.tox$Race)
tbl.age.group

          
           AA  C
  Controls  5 21
  Welders   2 19

chisq.test(tbl.age.group)


    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl.age.group
X-squared = 0.26754, df = 1, p-value = 0.605

fisher.test(tbl.age.group, conf.int = T, conf.level = 0.95)


    Fisher's Exact Test for Count Data

data:  tbl.age.group
p-value = 0.4364
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
  0.3170452 25.9844336
sample estimates:
odds ratio 
   2.22477

The welders are about 5 years younger than the controls on average, have relatively fewer African Americans, and more smokers.
Given the t-test (t = 2.25) and a two-sided significance level of 0.03 showed the difference in age between welders and controls is
The 2 binary variables race & smoking are neither significant by that standard nor by Fisher’s exact test for a 2×2 table.

Notation

Data genetic toxicity had \(L\) = 47 subjects, \(l = 1,2, ..., L\).

For subject \(l\), the observed covariate, \(\textbf{x}_l\), is 3-dimensional, \(\textbf{x}_l = (x_{l1}, x_{l2}, x_{l3})^T\) , where

\(x_{l1}\) is the age
\(x_{l2}\) encodes race, \(x_{l2}\) = 1 if \(l\) is African American, \(x_{l2}\) =0 if \(l\) is Caucasian,
\(x_{l3}\) encodes smoking, \(x_{l3}\) = 1 if \(l\) is is a current smoker, \(x_{l3}\) = 0 otherwise.

\(\rightarrow\) For instance, \(x_1\) = \((38,0,0)^T\) .

The variable \(Z_l\) distinguishes treated subjects from potential controls: \(Z_l\) = 1 for a treated subject, here a welder; \(Z_l\) = 0 for a potential control.

gen.tox$Race <- ifelse(gen.tox$Race=="C", 0, 1)
gen.tox$Smoker <- ifelse(gen.tox$Smoker=="N", 0, 1)

Propensity Score

The PS is the conditional probability of exposure to treatment given the observed covariates, \(e (\textbf{X}) = Pr(\textbf{Z} = 1| \textbf{x})\).
The PS is defined in terms of the observed covariates, x, even though there is invariably concern about other covariates that were not measured.
Properties of the PS:
- balance property
- the ‘ideal match’ could be produced simply by matching for the observed covariate x. Recall also that it may be difficult to match closely for every one of the many covariates in \(x\), but it is easy to match on one variable, the propensity score, \(e(\mathbf{x})\), and doing that balances all of \(x\).

Brief examination of Genetic Toxicity data suggested:

at least age, \(x_1\), and possibly race and smoking, \(x_2\) and \(x_3\), could be used to predict treatment assignment \(Z\), so that the PS is not constant.
Saying that welders tend to be somewhat younger than controls is much the same as saying that the chance of being a welder is lower for someone who is older, i.e. \(e(\textbf{X})\) is lower when \(x_1\) is higher.
If 2 subjects have the same PS \(e(\textbf{X})\), they may have different values of \(X\) \(\rightarrow\) subjects were matched for \(e(\textbf{X})\), they may be mismatched for x \(\rightarrow\) the mismatches in x will be due to chance and will tend to balance, particularly in large samples. ( e.g. if young nonsmokers and old smokers have the same PS, then a match on the PS may pair a young nonsmoking welder to an old smoking control, but it will do this about as often as it pairs an old smoking welder to a young nonsmoking control) \(\rightarrow\) matching on \(e(\textbf(X))\) tends to balance \(\textbf{X}\) \(\rightarrow\) \(Z\)| \(e(\textbf{X})\) indep. \(X\)

In short,

matching on \(e(\textbf{X})\) is often practical even when there are many covariates in x because \(e(\textbf{X})\) is a single variable,
failure to balance \(e(\textbf{X})\) implies that \(X\) is not balanced
matching on \(e(\textbf{X})\) tends to balance all of \(X\). On the negative side, success in balancing the observed covariates, \(X\), provides no assurance that unmeasured covariates are balanced

\(\rightarrow\) For example, the men whose his father worked as a welder is associated with a much higher chance that the son also will work as a welder, so father’s occupation is an unmeasured covariate that could be used to predict treatment \(Z\); then success in matching on the propensity score \(e(\textbf{X})\) would tend to balance age, race and smoking, but there is no reason to expect it to balance father’s occupation.

Conduct `PS`

The PS is unknown, but it can be estimated from the data at hand. PS is estimated by a linear logit model \[ log {\frac{e(\textbf{X}_l)}{1 - e(\textbf{X}_l)}} = \zeta_0 + \zeta_1 x_{l1} + \zeta_2 x_{l2} + \zeta_3 x_{l3}\]

and the fitted values \(\hat{e}(\textbf{X}_l)\) from this model are the estimates of the PS.

#fit a propensity score model. logistic regression
gen.tox$Group <- relevel(as.factor(gen.tox$Group), ref="Controls")
psmodel <- glm(Group ~ Age + Race + Smoker, family=binomial(), data=gen.tox)

#show coefficients etc
summary(psmodel)


Call:
glm(formula = Group ~ Age + Race + Smoker, family = binomial(), 
    data = gen.tox)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4968  -1.0216  -0.5245   1.0786   1.8186  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  3.06543    2.14200   1.431    0.152
Age         -0.08503    0.05178  -1.642    0.101
Race        -0.76900    0.98252  -0.783    0.434
Smoker       0.55095    0.65212   0.845    0.398

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 64.623  on 46  degrees of freedom
Residual deviance: 58.706  on 43  degrees of freedom
AIC: 66.706

Number of Fisher Scoring iterations: 3

#create propensity score
pscore<-psmodel$fitted.values
gen.tox$ps <- round(pscore,2)

# mean of PS in Welders group
mean.ps.welder <- round(mean(gen.tox$ps[gen.tox$Group=="Welders"]),2)

# mean of PS in 21 largest Controls group
mean.ps.21largestcontrols <- gen.tox %>%
  filter(Group=="Controls") %>%
  top_n(21, ps) %>% # wrapper that uses filter() and min_rank() to select the top or bottom entries in each group, ordered by ps.
  arrange(desc(ps)) %>% # order data follow ps decreasing
  summarize(round(mean(ps),2)) %>%
  as.numeric()

Estimated PS for 21 railroad arc welders and 26 potential controls. Covariates are age, race (C=Caucasian, AA=African American), current smoker (Y=yes, N=no).

ID	Age	Race	Smoker	DPC	\(\hat{e}(\textbf{X})\)	ID	Age	Race	Smoker	DPC	\(\hat{e}(\textbf{X})\)
1	38	C	N	1.77	0.46	1	48	AA	N	1.08	0.14
2	44	C	N	1.02	0.34	2	63	C	N	1.09	0.09
3	39	C	Y	1.44	0.57	3	44	C	Y	1.1	0.47
4	33	AA	Y	0.65	0.51	4	40	C	N	1.1	0.42
5	35	C	Y	2.08	0.65	5	50	C	N	0.93	0.23
6	39	C	Y	0.61	0.57	6	52	C	N	1.11	0.2
7	27	C	N	2.86	0.68	7	56	C	N	0.98	0.15
8	43	C	Y	4.19	0.49	8	47	C	N	2.2	0.28
9	39	C	Y	4.88	0.57	9	38	C	N	0.88	0.46
10	43	AA	N	1.08	0.2	10	34	C	N	1.55	0.46
11	41	C	Y	2.03	0.53	11	42	C	N	0.55	0.54
12	36	C	N	2.81	0.5	12	36	C	Y	1.04	0.38
13	35	C	N	0.94	0.52	13	41	C	N	1.66	0.64
14	37	C	N	1.43	0.48	14	41	AA	Y	1.49	0.4
15	39	C	Y	1.25	0.57	15	31	AA	Y	1.36	0.35
16	34	C	N	2.97	0.54	16	56	AA	Y	1.02	0.55
17	35	C	Y	1.01	0.65	17	51	AA	N	0.99	0.13
18	53	C	N	2.07	0.19	18	36	C	Y	0.65	0.12
19	38	C	Y	1.15	0.6	19	44	C	N	0.42	0.64
20	37	C	N	1.07	0.48	20	35	C	N	2.33	0.34
21	38	C	Y	1.63	0.6	21	34	C	Y	0.97	0.52
						22	39	C	Y	0.62
						23	45	C	N	1.02
						24	42	C	N	1.78
						25	30	C	N	0.95
						26	35	C	Y	1.59

ps.w <- round(mean(gen.tox$ps[gen.tox$Group=="Welders"]),2)
ps.c <- round(mean(gen.tox$ps[gen.tox$Group=="Controls"]),2)

Welders — Controls

Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)	—	Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)
38	10	52	0.51	—	43	19	35	0.4

Welders #10 and #18 have similar \(\hat{e}(\textbf{x})\) but different patterns of covariates \(X\).
The limitations of pair matching, apply with any variable, including the PS. The mean of the 21 largest \(\hat{e}(\textbf{x})\)’s in the control group is 0.46, somewhat less than the mean of \(\hat{e}(\textbf{x})\)) in the treated group, namely 0.51, so no pair matching can completely close that gap.

Distance Matrices

Squared difference in the \(\hat{e}(\textbf{x})\) distance

A distance matrix is a table with one row for each treated subject and one column for each potential control. The value in row \(i\) and column \(j\) of the table is the distance between the \(i^{th}\) treated subject and the \(j^{th}\) potential control. 2 individuals with the same value \(\mathbf{x}\) would have distance zero.
The welder data: 21 rows x 26 columns (21x26) distance. The distance is the squared difference in the \(\hat{e}(\textbf{x})\)

ps.vec.c <- gen.tox$ps[gen.tox$Group=="Controls"]
ps.vec.w <- gen.tox$ps[gen.tox$Group=="Welders"]
d.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
  for (j in 1:26){
    d.m[i,j] <- round((ps.vec.w[i] - ps.vec.c[j])^2, 2)
  }
}

Table. Squared differences in PS between welders and controls: \((\hat{e}(\textbf{X})_{Welder} - \hat{e}(\textbf{X})_{Control})^2\)

Rows are the 21 welders and columns are for the first 6 of 26 potential controls.

Welder <- c(1:21)
d.m.d <- as.data.frame(d.m)
names(d.m.d) <- paste0("Control ", 1:26)
d.m.d.n <- as.data.frame(cbind(Welder, d.m.d))
kable(d.m.d.n[,1:7], caption = "Squared differences in `PS` between welders and controls")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

(#tab:table of sq difference PS)Squared differences in `PS` between welders and controls
Welder	Control 1	Control 2	Control 3	Control 4	Control 5	Control 6
1	0.10	0.14	0.00	0.00	0.05	0.07
2	0.04	0.06	0.02	0.01	0.01	0.02
3	0.18	0.23	0.01	0.02	0.12	0.14
4	0.14	0.18	0.00	0.01	0.08	0.10
5	0.26	0.31	0.03	0.05	0.18	0.20
6	0.18	0.23	0.01	0.02	0.12	0.14
7	0.29	0.35	0.04	0.07	0.20	0.23
8	0.12	0.16	0.00	0.00	0.07	0.08
9	0.18	0.23	0.01	0.02	0.12	0.14
10	0.00	0.01	0.07	0.05	0.00	0.00
11	0.15	0.19	0.00	0.01	0.09	0.11
12	0.13	0.17	0.00	0.01	0.07	0.09
13	0.14	0.18	0.00	0.01	0.08	0.10
14	0.12	0.15	0.00	0.00	0.06	0.08
15	0.18	0.23	0.01	0.02	0.12	0.14
16	0.16	0.20	0.00	0.01	0.10	0.12
17	0.26	0.31	0.03	0.05	0.18	0.20
18	0.00	0.01	0.08	0.05	0.00	0.00
19	0.21	0.26	0.02	0.03	0.14	0.16
20	0.12	0.15	0.00	0.00	0.06	0.08
21	0.21	0.26	0.02	0.03	0.14	0.16

Look at the disadvantage:
- that 2 controls with the same \(\hat{e}(\mathbf{x})\) may have different patterns of covariates, \(\mathbf{x}\), and this is ignored
\(\Rightarrow\) In the \(1^{st}\) row and \(3^{rd}\) and \(4^{th}\) columns of Table above, the distance is 0 to two decimal places. Actually, the distances between welder #1 and potential controls #3 and #4 are, respectively, \((0.46−0.47)^2 = 0.0001\) and \((0.46−0.42)^2 = 0.0016,\) so control #3 is ever so slightly closer to welder #1.
- However, in terms of the details of \(\textbf{x}\), control #4 looks to be the better match, a nonsmoker with a two-year difference in age, as opposed to control #3, a smoker with a six-year difference in age.
\(\rightarrow\) Because younger smokers are more common in the welder group, the PS is indifferent between a younger nonsmoker and an older smoker, but the details of \(\textbf{x}\) suggest that control #4 is a better match for welder #1 than is control #3.

Justifying by adding caliper on the `PS`

An alternative distance (Paul R. Rosenbaum and Rubin 1985):

individuals be close on the PS, \(\hat{e}(\textbf{x})\), but once this is achieved, the details of \(\textbf{x}\) affect the distance.
with a caliper of width \(w\), if two individuals, say \(k\) and \(l\),
- have PS that differ by more than \(w\) i.e. \(| \hat{e}(\textbf{x}_k) - \hat{e}(\textbf{x}_l) | > w\) \(\rightarrow\) set \(w\) = \(\infty\)
- if \(| \hat{e}(\textbf{x}_k) - \hat{e}(\textbf{x}_l) | \le w\) \(\rightarrow\) the distance is a measure of proximity of \(\textbf{x}_k\) and \(\textbf{x}_l\).

sd.ps <- round(sd(pscore),3) # [1] 0.172
w <- round(sd.ps/2,3)
paste0("The standard deviation of ps: ", sd.ps)

[1] "The standard deviation of ps: 0.172"

paste0("The caliper w (half of sd of ps): ", sd.ps, ", used to demonstate in this post.")

[1] "The caliper w (half of sd of ps): 0.172, used to demonstate in this post."

The caliper width, \(w\), is often taken as a multiple of the standard deviation of the PS, so that by varying the multiplier, one can vary the relative importance given to \(\hat{e}(\textbf{x})\) and \(\textbf{x}\).
- The standard deviation of \(\hat{e}(\textbf{X})\) is 0.172.
- We illustrate 2 distance matrices using a caliper on the PS, in which the caliper is half of the standard deviation of the PS, or 0.172/2 = 0.086.

In problems of practical size, a caliper of 20% of the sd of ps is more common, and even that may be too large. A reasonable strategy: to start with a caliper of 20% of the sd of ps \(\rightarrow\) adjusting the caliper if needed to obtain balance on the propensity score.

Mahalanobis distance matrix

If \(\hat{\Sigma}\) is the sample covariance matrix of \(\textbf{x}\), then the estimated Mahalanobis distance between \(\textbf{x}_k\) and \(\textbf{x}_l\) is \[ (\textbf{x}_k - \textbf{x}_l)^T \hat{\Sigma}^{-1} (\textbf{x}_k - \textbf{x}_l) \]

var.m <-gen.tox %>%
  dplyr::select(Age, Race, Smoker) %>%
  unname() %>%
  as.matrix()

Sample var-cov matrix

# sample variance covariance matrix
cov.m <- cov(var.m)
round(cov.m,2)

      [,1] [,2]  [,3]
[1,] 54.04 0.39 -0.94
[2,]  0.39 0.13  0.02
[3,] -0.94 0.02  0.25

Inverse of sample var-cov matrix

# inverse of sample variance covariance matrix
cov.m.inv <- inv(cov.m)
round(cov.m.inv,3)

       [,1]   [,2]   [,3]
[1,]  0.021 -0.077  0.084
[2,] -0.077  8.127 -1.009
[3,]  0.084 -1.009  4.407

Mahalanobis `dmat` Rubin (1980)

Distance matrix would have 21 treated subject (Welders) and 26 controls; the value in row \(i\) and column \(j\) of the table is the distance between the \(i^{th}\) treated subject and the \(j^{th}\) potential control.

var.m.w <-gen.tox %>%
  filter(Group=="Welders") %>%
  dplyr::select(Age, Race, Smoker) %>%
  as.matrix()

var.m.c <-gen.tox %>%
  filter(Group=="Controls") %>%
  dplyr::select(Age, Race, Smoker) %>%
  as.matrix()

m.d.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
  for (j in 1:26){
    m.d.m[i,j] <- t(var.m.w[i,] - var.m.c[j,]) %*% cov.m.inv %*% (var.m.w[i,] - var.m.c[j,])
  }
}

# Mahalanobis distance
Welder <- c(1:21)
d.m.d.m <- as.data.frame(m.d.m)
names(d.m.d.m) <- paste0("Control ", 1:26)
d.d.m.d.n <- cbind(Welder, d.m.d.m)
kable(round(d.d.m.d.n[,1:7],2), caption = "Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 1: Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker
Welder	Control 1	Control 2	Control 3	Control 4	Control 5	Control 6
1	8.65	12.82	6.15	0.08	2.95	4.02
2	7.84	7.40	4.41	0.33	0.74	1.31
3	13.33	12.21	0.51	4.26	5.05	5.70
4	6.51	28.55	12.29	11.42	16.20	17.65
5	13.85	15.80	1.66	4.08	6.51	7.49
6	13.33	12.21	0.51	4.26	5.05	5.70
7	13.95	26.58	13.18	3.47	10.85	12.82
8	13.46	9.27	0.02	5.09	4.24	4.56
9	13.33	12.21	0.51	4.26	5.05	5.70
10	0.51	19.40	14.89	7.85	10.20	11.17
11	13.31	10.65	0.18	4.59	4.56	5.05
12	9.24	14.95	7.06	0.33	4.02	5.25
13	9.60	16.08	7.57	0.51	4.61	5.93
14	8.92	13.87	6.58	0.18	3.47	4.61
15	13.33	12.21	0.51	4.26	5.05	5.70
16	10.00	17.25	8.13	0.74	5.25	6.65
17	13.85	15.80	1.66	4.08	6.51	7.49
18	9.41	2.05	4.56	3.47	0.18	0.02
19	13.40	13.04	0.74	4.15	5.35	6.08
20	8.92	13.87	6.58	0.18	3.47	4.61
21	13.40	13.04	0.74	4.15	5.35	6.08

Another way to produce the Mahalanobis distance matrix: look at the source code of mahal function.

source("mahal.R")
print(mahal)

function (z, X) 
{
    X <- as.matrix(X)
    n <- dim(X)[1]
    rownames(X) <- 1:n
    k <- dim(X)[2]
    m <- sum(z)
    cv <- cov(X)
    out <- matrix(NA, m, n - m)
    Xt <- X[z == 1, ]
    Xc <- X[z == 0, ]
    rownames(out) <- rownames(X)[z == 1]
    colnames(out) <- rownames(X)[z == 0]
    require(MASS)
    icov <- ginv(cv)
    for (i in 1:m) {
        out[i, ] <- mahalanobis(Xc, Xt[i, ], icov, inverted = T)
    }
    out
}

Mahalanobis dmat within `PS` calipers.

Rows are the 21 welders and columns are for the first 6 of 26 potential controls. An \(\infty\) signifies that the caliper is violated.

m.m <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
  for (j in 1:26){
    m.m[i,j] <- ifelse(abs(ps.vec.w[i] - ps.vec.c[j]) > w, Inf, m.d.m[i,j])
  }
}
Welder <- c(1:21)
d.m.m <- as.data.frame(m.m)
names(d.m.m) <- paste0("Control ", 1:26)
d.d.m.m <- cbind(Welder, d.m.m)
kable(round(d.d.m.m[,1:7],2), caption = "Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

(#tab:mahal w PS caliper)Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.
Welder	Control 1	Control 2	Control 3	Control 4	Control 5	Control 6
1	Inf	Inf	6.15	0.08	Inf	Inf
2	Inf	Inf	Inf	0.33	Inf	Inf
3	Inf	Inf	Inf	Inf	Inf	Inf
4	Inf	Inf	12.29	Inf	Inf	Inf
5	Inf	Inf	Inf	Inf	Inf	Inf
6	Inf	Inf	Inf	Inf	Inf	Inf
7	Inf	Inf	Inf	Inf	Inf	Inf
8	Inf	Inf	0.02	5.09	Inf	Inf
9	Inf	Inf	Inf	Inf	Inf	Inf
10	0.51	Inf	Inf	Inf	10.20	11.17
11	Inf	Inf	0.18	Inf	Inf	Inf
12	Inf	Inf	7.06	0.33	Inf	Inf
13	Inf	Inf	7.57	Inf	Inf	Inf
14	Inf	Inf	6.58	0.18	Inf	Inf
15	Inf	Inf	Inf	Inf	Inf	Inf
16	Inf	Inf	8.13	Inf	Inf	Inf
17	Inf	Inf	Inf	Inf	Inf	Inf
18	9.41	Inf	Inf	Inf	0.18	0.02
19	Inf	Inf	Inf	Inf	Inf	Inf
20	Inf	Inf	6.58	0.18	Inf	Inf
21	Inf	Inf	Inf	Inf	Inf	Inf

Cons of Mahalanobis distance

Mahalanobis distance was originally developed for use with multivariate Normal data
\(\rightarrow\) With data that are not Normal, Mahalanobis distance can exhibit some rather odd behavior.
If one covariate
- contains extreme outliers or
- has a long-tailed distribution,

\(\rightarrow\) its standard deviation will be inflated, and \(\rightarrow\) Mahalanobis distance will tend to ignore that covariate in matching.

With binary indicators,
- the variance is largest for events that occur about half the time, and
- it is smallest for events with probabilities near 0 and 1

\(\rightarrow\) Mahalanobis distance gives greater weight to binary variables with probabilities near 0 or 1 than to binary variables with probabilities closer to 1/2.

In welder data,
- 2 individuals with the same Age and Race but different smoking behavior have a Mahalanobis distance of 4.41,
- whereas 2 individuals with the same Age and Smoking behavior but different Race have a Mahalanobis distance of 8.06,
  \(\rightarrow\) so a mismatch on Race is counted as almost twice as bad as a mismatch on Smoking.
- 2 people who differed by 20 years in Age with the same Race and Smoking would have a Mahalanobis distance of 0.021 × \(20^2\) = 8.204232,
  \(\rightarrow\) a difference in Race counts about as much as a 20-year difference in Age.

\(\Longrightarrow\) In many contexts, rare binary covariates are not of overriding importance, and outliers do not make a covariate unimportant \(\rightarrow\) Mahalanobis distance may not be appropriate with covariates of this kind.

Solution `rank-based Mahalanobis distance`

replaces each of the covariates, one at a time, by its ranks, with average ranks for ties,
\(\Longrightarrow\) limits the influence of outliers.
premultiplies and postmultiplies the covariance matrix of the ranks by a diagonal matrix whose diagonal elements are the ratios of the standard deviation of untied ranks, 1, …, L, to the standard deviations of the tied ranks of the covariates,
\(\rightarrow\) the adjusted covariance matrix has a constant diagonal
\(\Longrightarrow\) prevents heavily tied covariates, such as rare binary variables, from having increased influence due to reduced variance.
computes the Mahalanobis distance using the ranks and this adjusted covariance matrix.

### Note that the covariance matrix now is on the rank-based matrix
data.r <- gen.tox %>%
  dplyr::select(Age, Race, Smoker) %>%
  as.matrix()
for (j in 1:dim(data.r)[2]){data.r[,j] <- rank(data.r[,j])}
cov.m.r <- cov(data.r)

## Step 1
var.m.w.r <- data.r[gen.tox$Group=="Welders",]
var.m.c.r <- data.r[gen.tox$Group=="Controls",]

## Step 2:
var.untied <- var(1:dim(gen.tox)[1])
rat <- sqrt(var.untied/diag(cov.m.r))

cov.m.r.adjusted <- diag(rat) %*% cov.m.r %*% diag(rat)
cov.m.r.adjusted.inv <- inv(cov.m.r.adjusted)

m.d.m.r <- matrix(NA, nrow = 21, ncol = 26)

for (i in 1:21){
  for (j in 1:26){
    m.d.m.r[i,j] <- t(var.m.w.r[i,] - var.m.c.r[j,]) %*% cov.m.r.adjusted.inv %*% (var.m.w.r[i,] - var.m.c.r[j,])
  }
}

# rank-based Mahalanobis distance
Welder <- c(1:21)
d.m.d.m.r <- as.data.frame(m.d.m.r)
names(d.m.d.m.r) <- paste0("Control ", 1:26)
d.d.m.d.m.r <- cbind(Welder, d.m.d.m.r)
kable(round(d.d.m.d.m.r[,1:7],2), caption = "Rank-based Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

(#tab:rank-based Mahalanobis distance)Rank-based Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker
Welder	Control 1	Control 2	Control 3	Control 4	Control 5	Control 6
1	4.61	4.40	5.98	0.33	2.69	3.21
2	2.98	0.70	3.21	0.47	0.15	0.28
3	6.91	4.62	0.84	3.04	3.66	3.93
4	8.14	14.80	10.43	7.69	12.17	13.00
5	9.03	8.49	3.93	3.66	6.55	7.15
6	6.91	4.62	0.84	3.04	3.66	3.93
7	10.20	12.31	12.86	3.93	9.30	10.26
8	6.77	3.29	0.04	3.92	2.99	3.05
9	6.91	4.62	0.84	3.04	3.66	3.93
10	0.25	4.72	7.61	3.03	3.72	4.01
11	6.71	3.79	0.28	3.38	3.18	3.34
12	5.86	6.33	7.61	0.98	4.24	4.89
13	6.98	7.96	9.02	1.68	5.59	6.33
14	5.25	5.41	6.83	0.64	3.49	4.08
15	6.91	4.62	0.84	3.04	3.66	3.93
16	8.30	9.78	10.61	2.56	7.12	7.96
17	9.03	8.49	3.93	3.66	6.55	7.15
18	3.33	0.05	3.00	1.68	0.05	0.01
19	7.34	5.62	1.58	2.99	4.34	4.72
20	5.25	5.41	6.83	0.64	3.49	4.08
21	7.34	5.62	1.58	2.99	4.34	4.72

Another way to produce the rank-based Mahalanobis distance matrix: look at the source code of smahal function.

source("smahal.R")
print(smahal)

function (z, X) 
{
    X <- as.matrix(X)
    n <- dim(X)[1]
    rownames(X) <- 1:n
    k <- dim(X)[2]
    m <- sum(z)
    for (j in 1:k) {
        X[, j] <- rank(X[, j])
    }
    cv <- cov(X)
    vuntied <- var(1:n)
    rat <- sqrt(vuntied/diag(cv))
    cv <- diag(rat) %*% cv %*% diag(rat)
    out <- matrix(NA, m, n - m)
    Xc <- X[z == 0, ]
    Xt <- X[z == 1, ]
    rownames(out) <- rownames(X)[z == 1]
    colnames(out) <- rownames(X)[z == 0]
    library(MASS)
    icov <- ginv(cv)
    for (i in 1:m) {
        out[i, ] <- mahalanobis(Xc, Xt[i, ], icov, inverted = T)
    }
    out
}

In welder data,
- 2 individuals (Welder 2 vs Control 3) with the same Age and Race but different smoking behavior have a rank-based Mahalanobis distance of 3.21 (4.41 for a Mahalanobis distance),
- whereas 2 individuals (Welder 11 vs Control 14) with the same Age and Smoking behavior but different Race have a rank-based Mahalanobis distance of 3.07 (8.06 for a Mahalanobis distance),

\(\Longrightarrow\) a mismatch on race is counted as about equal to a mismatch on smoking.

Rank-based Mahalanobis distances within Propensity Score calipers. An Inf signifies that the caliper is violated

m.m.r <- matrix(NA, nrow = 21, ncol = 26)
for (i in 1:21){
  for (j in 1:26){
    m.m.r[i,j] <- ifelse(abs(ps.vec.w[i] - ps.vec.c[j]) > w, Inf, m.d.m.r[i,j])
  }
}
Welder <- c(1:21)
d.m.m.r <- as.data.frame(m.m.r)
names(d.m.m.r) <- paste0("Control ", 1:26)
d.d.m.m.r <- cbind(Welder, d.m.m.r)
kable(round(d.d.m.m.r[,1:7],2), caption = "rank-based Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

(#tab:r-mahal w PS caliper)rank-based Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.
Welder	Control 1	Control 2	Control 3	Control 4	Control 5	Control 6
1	Inf	Inf	5.98	0.33	Inf	Inf
2	Inf	Inf	Inf	0.47	Inf	Inf
3	Inf	Inf	Inf	Inf	Inf	Inf
4	Inf	Inf	10.43	Inf	Inf	Inf
5	Inf	Inf	Inf	Inf	Inf	Inf
6	Inf	Inf	Inf	Inf	Inf	Inf
7	Inf	Inf	Inf	Inf	Inf	Inf
8	Inf	Inf	0.04	3.92	Inf	Inf
9	Inf	Inf	Inf	Inf	Inf	Inf
10	0.25	Inf	Inf	Inf	3.72	4.01
11	Inf	Inf	0.28	Inf	Inf	Inf
12	Inf	Inf	7.61	0.98	Inf	Inf
13	Inf	Inf	9.02	Inf	Inf	Inf
14	Inf	Inf	6.83	0.64	Inf	Inf
15	Inf	Inf	Inf	Inf	Inf	Inf
16	Inf	Inf	10.61	Inf	Inf	Inf
17	Inf	Inf	Inf	Inf	Inf	Inf
18	3.33	Inf	Inf	Inf	0.05	0.01
19	Inf	Inf	Inf	Inf	Inf	Inf
20	Inf	Inf	6.83	0.64	Inf	Inf
21	Inf	Inf	Inf	Inf	Inf	Inf

Optimal Pair Matching

An optimal pair matching pairs each treated subject with a different control to minimize the total distance within matched pairs.
- In the welder data: forming 21 pairs using 21 different controls from the 26 potential controls so that the sum of the 21 distances within pairs is minimized.
- The problem is serious: the closest control to one treated subject may also be the closest control to another treated subject.

Hansen’s pairmatch function in his optmatch package makes Bertsekas’ Fortran code (aka ‘assignment problem’ solved by Kuhn in 1955) available from inside the statistical package R;

The SAS program proc assign also solves the assignment problem.

Match pair on `PS`

I generated code in which the treated’s ps match with the closest control’s ps. It seemed the marginal means were well balanced.

ctrl <- NULL
k <- NULL

d <- d.m.d.n[-1]

for (i in 1:21){
  k <- which.min(d[i,])
  ctrl[i] <- names(k)
  d <- d[-k]
}

index <- base::order(ctrl)

Welders <- gen.tox %>%
  filter(Group=="Welders") %>%
  dplyr::select(Age, Race, Smoker, ps)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps")]

Controls <- gen.tox %>%
  filter(Group=="Controls") %>%
  dplyr::select(Age, Race, Smoker, ps)

Controls$id <- paste0("Control ",1:26)

matched.controls <- Controls[which(Controls$id %in% ctrl),]

# Reorder follow the ctrl match order
matched.controls$id <- reorder.factor(matched.controls$id, new.order=ctrl)
#matched.controls <- matched.controls[-5]

matched.controls <- matched.controls %>% dplyr::arrange(id)

kable(cbind(Welders,matched.controls), caption="Pair match using the closed measure in the PS")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

(#tab:pairmatch using closest ps)Pair match using the closed measure in the PS
Pair	Age	Race	Smoker	ps	Age	Race	Smoker	ps	id
Pair 1	38	0	0	0.46	44	0	1	0.47	Control 3
Pair 2	44	0	0	0.34	47	0	0	0.28	Control 8
Pair 3	39	0	1	0.57	34	0	0	0.54	Control 10
Pair 4	33	1	1	0.51	38	0	0	0.46	Control 9
Pair 5	35	0	1	0.65	36	0	1	0.64	Control 12
Pair 6	39	0	1	0.57	31	1	1	0.55	Control 15
Pair 7	27	0	0	0.68	36	0	1	0.64	Control 18
Pair 8	43	0	1	0.49	40	0	0	0.42	Control 4
Pair 9	39	0	1	0.57	35	0	0	0.52	Control 20
Pair 10	43	1	0	0.20	48	1	0	0.14	Control 1
Pair 11	41	0	1	0.53	39	0	1	0.57	Control 22
Pair 12	36	0	0	0.50	42	0	0	0.38	Control 11
Pair 13	35	0	0	0.52	41	0	0	0.40	Control 13
Pair 14	37	0	0	0.48	42	0	0	0.38	Control 24
Pair 15	39	0	1	0.57	30	0	0	0.63	Control 25
Pair 16	34	0	0	0.54	35	0	1	0.65	Control 26
Pair 17	35	0	1	0.65	34	0	1	0.67	Control 21
Pair 18	53	0	0	0.19	50	0	0	0.23	Control 5
Pair 19	38	0	1	0.60	41	1	1	0.35	Control 14
Pair 20	37	0	0	0.48	44	0	0	0.34	Control 19
Pair 21	38	0	1	0.60	45	0	0	0.32	Control 23

m.age.mps <- round(mean(matched.controls$Age),0)

aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)

s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)

ps.mps <- round(mean(matched.controls$ps),2)

Welders — Matched Controls

Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)	—	Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)
38	10	52	0.51	—	40	14	38	0.46

\(\Longrightarrow\) the marginal means the matching above are well balanced as expected

Next, I use the pairmatch function in optmatch package to compare.

gen.tox.op <- gen.tox
gen.tox.op$Group1 <- as.numeric(ifelse(gen.tox.op$Group=="Welders", 1,0))
#pairmatch(match_on( Group~ps, data=gen.tox ) )# did not run
pm <- pairmatch(Group1 ~ ps, data = gen.tox.op)
print(pairmatch(Group1 ~ ps, data = gen.tox.op), grouped = TRUE) # knowing the ordered id of treated and controls

 Group Members
   1.1   1, 35
  1.10  18, 27
  1.11  19, 46
  1.12   2, 29
  1.13  20, 34
  1.14  21, 39
  1.15   3, 41
  1.16   4, 30
  1.17   5, 47
  1.18   6, 43
  1.19   7, 42
   1.2  10, 26
  1.20   8, 45
  1.21   9, 31
   1.3  11, 25
   1.4  12, 40
   1.5  13, 44
   1.6  14, 24
   1.7  15, 36
   1.8  16, 32
   1.9  17, 33

summary(pairmatch(Group1 ~ ps, data = gen.tox.op))

Structure of matched sets:
1:1 0:1 
 21   5 
Effective Sample Size:  21 
(equivalent number of matched pairs).

#all.equal(names(pm), row.names(gen.tox))

## Housekeeping
ipm <- as.integer(pm)
gen.tox.op <- cbind(gen.tox.op, matches=pm, ipm)
data.pairmatch<-gen.tox.op[matched(pm),] # only select matched cases

Welders <- data.pairmatch %>%
  filter(Group=="Welders") %>%
  arrange(ipm) %>%
  dplyr::select(Age, Race, Smoker, ps, ipm)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm")]

matched.controls <- data.pairmatch %>%
  filter(Group=="Controls") %>%
  arrange(ipm) %>%
  dplyr::select(Age, Race, Smoker, ps, ipm)

kable(cbind(Welders,matched.controls), caption="Optimal pair match using the squared difference in the PS")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 2: Optimal pair match using the squared difference in the PS
	Pair	Age	Race	Smoker	ps	ipm	Age	Race	Smoker	ps	ipm
1	Pair 1	38	0	0	0.46	1	41	1	1	0.35	1
18	Pair 2	53	0	0	0.19	2	52	0	0	0.20	2
19	Pair 3	38	0	1	0.60	3	30	0	0	0.63	3
2	Pair 4	44	0	0	0.34	4	47	0	0	0.28	4
20	Pair 5	37	0	0	0.48	5	41	0	0	0.40	5
21	Pair 6	38	0	1	0.60	6	36	0	1	0.64	6
3	Pair 7	39	0	1	0.57	7	35	0	0	0.52	7
4	Pair 8	33	1	1	0.51	8	38	0	0	0.46	8
5	Pair 9	35	0	1	0.65	9	35	0	1	0.65	9
6	Pair 10	39	0	1	0.57	10	39	0	1	0.57	10
7	Pair 11	27	0	0	0.68	11	34	0	1	0.67	11
10	Pair 12	43	1	0	0.20	12	50	0	0	0.23	12
8	Pair 13	43	0	1	0.49	13	42	0	0	0.38	13
9	Pair 14	39	0	1	0.57	14	34	0	0	0.54	14
11	Pair 15	41	0	1	0.53	15	40	0	0	0.42	15
12	Pair 16	36	0	0	0.50	16	44	0	0	0.34	16
13	Pair 17	35	0	0	0.52	17	45	0	0	0.32	17
14	Pair 18	37	0	0	0.48	18	44	0	1	0.47	18
15	Pair 19	39	0	1	0.57	19	31	1	1	0.55	19
16	Pair 20	34	0	0	0.54	20	42	0	0	0.38	20
17	Pair 21	35	0	1	0.65	21	36	0	1	0.64	21

m.age.mps <- round(mean(matched.controls$Age),0)

aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)

s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)

ps.mps <- round(mean(matched.controls$ps),2)

Welders — Matched Controls

Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)	—	Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)
38	10	52	0.51	—	40	10	38	0.46

\(\Longrightarrow\) the marginal means the matching above are fairly well balanced, but the individual pairs are often not matched for race or smoking.

Match pair on `Mahalanobis distance within PS calipers`

The caliper is half the standard deviation of the PS, or 0.172/2 = 0.086

z<-1*(gen.tox$Group=="Welders")
aa<-gen.tox$Race
smoker=gen.tox$Smoker
age<-gen.tox$Age
x<-cbind(age,aa,smoker)
# Mahalanobis distances
dmat<-mahal(z,x)

# Impose propensity score calipers
prop<-gen.tox$ps # propensity score
# Mahalanobis distanced penalized for violations of a propensity score caliper.
# This version is used for numerical work.
dmat<-addcaliper(dmat,z,prop,caliper=.5)

# Find the minimum distance match within propensity score calipers.
pm.cal<-optmatch::pairmatch(dmat,data=gen.tox)#, remove.unmatchables = TRUE)

ipm.cal <- as.integer(pm.cal)
gen.tox.op.caliper <- cbind(gen.tox, matches=pm.cal,ipm.cal)
data.pairmatch.cal<-gen.tox.op.caliper[matched(pm.cal),] # only select matched cases

Welders <- data.pairmatch.cal %>%
  filter(Group=="Welders") %>%
  arrange(ipm.cal) %>%
  dplyr::select(Age, Race, Smoker, ps,ipm.cal)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm.cal")]

matched.controls <- data.pairmatch.cal %>%
  filter(Group=="Controls") %>%
  arrange(ipm.cal) %>%
  dplyr::select(Age, Race, Smoker, ps, ipm.cal)

kable(cbind(Welders,matched.controls), caption="Optimal pair match using the Mahalanobis distance within PS calipers")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 3: Optimal pair match using the Mahalanobis distance within PS calipers
	Pair	Age	Race	Smoker	ps	ipm.cal	Age	Race	Smoker	ps	ipm.cal
1	Pair 1	38	0	0	0.46	1	44	0	0	0.34	1
18	Pair 2	53	0	0	0.19	2	52	0	0	0.20	2
19	Pair 3	38	0	1	0.60	3	34	0	0	0.54	3
2	Pair 4	44	0	0	0.34	4	47	0	0	0.28	4
20	Pair 5	37	0	0	0.48	5	42	0	0	0.38	5
21	Pair 6	38	0	1	0.60	6	31	1	1	0.55	6
3	Pair 7	39	0	1	0.57	7	36	0	1	0.64	7
4	Pair 8	33	1	1	0.51	8	41	1	1	0.35	8
5	Pair 9	35	0	1	0.65	9	35	0	1	0.65	9
6	Pair 10	39	0	1	0.57	10	35	0	0	0.52	10
7	Pair 11	27	0	0	0.68	11	30	0	0	0.63	11
10	Pair 12	43	1	0	0.20	12	48	1	0	0.14	12
8	Pair 13	43	0	1	0.49	13	45	0	0	0.32	13
9	Pair 14	39	0	1	0.57	14	36	0	1	0.64	14
11	Pair 15	41	0	1	0.53	15	44	0	1	0.47	15
12	Pair 16	36	0	0	0.50	16	41	0	0	0.40	16
13	Pair 17	35	0	0	0.52	17	40	0	0	0.42	17
14	Pair 18	37	0	0	0.48	18	42	0	0	0.38	18
15	Pair 19	39	0	1	0.57	19	39	0	1	0.57	19
16	Pair 20	34	0	0	0.54	20	38	0	0	0.46	20
17	Pair 21	35	0	1	0.65	21	34	0	1	0.67	21

m.age.mps <- round(mean(matched.controls$Age),0)

aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)

s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)

ps.mps <- round(mean(matched.controls$ps),2)

Welders — Matched Controls

Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)	—	Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)
38	10	52	0.51	—	40	14	38	0.45

Match pair on `rank-based Mahalanobis distance within PS calipers`

The difference here is just substitute the mahal by smahal

z<-1*(gen.tox$Group=="Welders")
aa<-gen.tox$Race
smoker=gen.tox$Smoker
age<-gen.tox$Age
x<-cbind(age,aa,smoker)
# rank-based Mahalanobis distances
dmat<-smahal(z,x)

# Impose propensity score calipers
prop<-gen.tox$ps # propensity score
# Mahalanobis distanced penalized for violations of a propensity score caliper.
# This version is used for numerical work.
dmat<-addcaliper(dmat,z,prop,caliper=.5)

# Find the minimum distance match within propensity score calipers.
pm.cal<-optmatch::pairmatch(dmat,data=gen.tox)#, remove.unmatchables = TRUE)

ipm.cal <- as.integer(pm.cal)
gen.tox.op.caliper <- cbind(gen.tox, matches=pm.cal,ipm.cal)
data.pairmatch.cal<-gen.tox.op.caliper[matched(pm.cal),] # only select matched cases

Welders <- data.pairmatch.cal %>%
  filter(Group=="Welders") %>%
  arrange(ipm.cal) %>%
  dplyr::select(Age, Race, Smoker, ps,ipm.cal)
Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Pair","Age","Race","Smoker","ps","ipm.cal")]

matched.controls <- data.pairmatch.cal %>%
  filter(Group=="Controls") %>%
  arrange(ipm.cal) %>%
  dplyr::select(Age, Race, Smoker, ps, ipm.cal)

kable(cbind(Welders,matched.controls), caption="Optimal pair match using the Mahalanobis distance within PS calipers")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 4: Optimal pair match using the Mahalanobis distance within PS calipers
	Pair	Age	Race	Smoker	ps	ipm.cal	Age	Race	Smoker	ps	ipm.cal
1	Pair 1	38	0	0	0.46	1	44	0	0	0.34	1
18	Pair 2	53	0	0	0.19	2	52	0	0	0.20	2
19	Pair 3	38	0	1	0.60	3	31	1	1	0.55	3
2	Pair 4	44	0	0	0.34	4	47	0	0	0.28	4
20	Pair 5	37	0	0	0.48	5	42	0	0	0.38	5
21	Pair 6	38	0	1	0.60	6	34	0	0	0.54	6
3	Pair 7	39	0	1	0.57	7	35	0	0	0.52	7
4	Pair 8	33	1	1	0.51	8	41	1	1	0.35	8
5	Pair 9	35	0	1	0.65	9	35	0	1	0.65	9
6	Pair 10	39	0	1	0.57	10	36	0	1	0.64	10
7	Pair 11	27	0	0	0.68	11	30	0	0	0.63	11
10	Pair 12	43	1	0	0.20	12	48	1	0	0.14	12
8	Pair 13	43	0	1	0.49	13	45	0	0	0.32	13
9	Pair 14	39	0	1	0.57	14	36	0	1	0.64	14
11	Pair 15	41	0	1	0.53	15	44	0	1	0.47	15
12	Pair 16	36	0	0	0.50	16	41	0	0	0.40	16
13	Pair 17	35	0	0	0.52	17	40	0	0	0.42	17
14	Pair 18	37	0	0	0.48	18	42	0	0	0.38	18
15	Pair 19	39	0	1	0.57	19	39	0	1	0.57	19
16	Pair 20	34	0	0	0.54	20	38	0	0	0.46	20
17	Pair 21	35	0	1	0.65	21	34	0	1	0.67	21

m.age.mps <- round(mean(matched.controls$Age),0)

aa.p.mps <- round(table(matched.controls$Race)[2]/21*100,0)

s.p.mps <- round(table(matched.controls$Smoker)[2]/21*100,0)

ps.mps <- round(mean(matched.controls$ps),2)

Welders — Matched Controls

Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)	—	Mean Age	AA	Smoker	\(\hat{e}(\mathbf{x})\)
38	10	52	0.51	—	40	14	38	0.45

Optimal Matching with Multiple Controls

In matching with multiple controls, each treated subject is matched to at least one, and possibly more than one, control:

To match in a fixed ratio is to match each treated subject to the same number of controls; for instance, pair matching in a ratio of 1-to-1, or matching in a ratio of 1-to-2.
To match in a variable ratio: to allow the number of controls to vary from one treated subject to another. In the welder data above, allows the possibility of matching with multiple controls in a variable ratio, and in particular to use all of the controls.

The decision to match in fixed or variable ratio that affects both:

the quality of the matching,
the analysis, and
presentation of results.

Matching with a variable number of controls is slightly more complex:

The optimization algorithm decides:
- who is matched to whom,
- how many controls to assign to each treated subject.
The choices of number of controls must be constrained in some reasonable way to avoid trivial results.
- without constraints imposed, if all the distances were positive, the minimum distance matching with variable numbers of controls would always be a pair matching.
- with a simple constraint is to insist that a certain number of controls be used.
  - for example of the welder data, one might insist that all 26 controls be used. Alternatively, one might insist that 23 controls be used, which would permit, but not require, the algorithm to discard the 3 potential controls who are older than all of the welders.
An attractive feature of fixing the total number of controls
- the total distance = a sum of a fixed number of distances.
  - to constraining the total number of controls: a different type of constraint permits a treated subject to have at least one but at most three controls.
- given some set of constraints on the matching, an optimal matching with variable controls minimizes the total distance between treated subjects and controls inside matched sets.

The principal advantage of matching in a fixed ratio:

summary statistics, including those that might be displayed in graphs, may be computed from treated and control groups in the usual way, without using direct adjustment to give unequal weights to observations. This advantage will weigh heavily when potential controls are abundant and the biases that must be removed are not large (!?!).
A smaller issue is statistical efficiency, which in nominal terms, though often not in practical terms, slightly favors matching in fixed ratio over matching with a variable ratio (Ming and Rosenbaum 2000).

Several advantages to matching with a variable ratio Paul R. Rosenbaum (1989):

First, the matched sets will be more closely matched, in the following precise sense. If one finds the minimum distance match with, say, an average of three controls per treated subject and two to four controls for each treated subject, the total distance within matched sets will never be larger, and will typically be quite a bit smaller, than in fixed ratio matching with three controls.4 This is visible in Table: the two welders with low propensity scores were matched to the seven controls with low propensity scores, and trying to allocate these seven more evenly among welders would have produced a larger mismatch on these propensity scores.
Second, matching in fixed ratio requires the number of controls to be an integer multiple of the number of treated subjects, and this restriction may be inconvenient or undesirable for any of a variety of reasons; for instance, for the welder data, the only possible matching in fixed ratio is pair matching.
Third, just as §8.1 discussed definite limits to what can be accomplished with pair matching, there are also definite limits to what can be accomplished by matching with multiple controls, but these limits are better when matching with a variable ratio.

Finding an optimal match with variable controls is equivalent to solving a particular minimum-cost flow problem in a network (Paul R. Rosenbaum 1989).

Optimal Full Matching

In full matching, besides matching with a variable number of controls, i.e. a treated subject could be matched to one or more controls, the reverse situation is permitted: one control may be matched to several treated subjects (P. R. Rosenbaum 1991).
As in matching with variable controls, summary statistics for the control group in full matching must be directly adjusted.
Full matching can be vastly better than pair matching or matching with a variable number of controls. The matched sets would be quite homogeneous, the quite unequal set sizes can lead to some inefficiency.
In one specific sense, an optimal full matching is an optimal design for an observational study (P. R. Rosenbaum 1991). Specifically, define a stratification to be a partitioning of the subjects into groups or strata based on the covariates with the one requirement that each stratum must contain at least one treated subject and at least one control. The quality of a stratification might reasonably be judged by a weighted average of all the within strata distances between treated subjects and controls.
No matter which of these weightings is used, no matter what distance is used, there is always a full matching that minimizes this weighted average distance (P. R. Rosenbaum 1991).

m<-fullmatch(dmat, data=gen.tox, min.controls = 1/7,max.controls = 7)
length(m)

[1] 47

sum(matched(m))

[1] 47

# Housekeeping
im<-as.integer(m)
gen.tox.fm<-cbind(gen.tox,im)
g.t.m<-gen.tox[matched(m),]

Welders <- gen.tox.fm %>%
  filter(Group=="Welders") %>%
  arrange(im) %>%
  dplyr::select(Age, Race, Smoker, ps, im)
#Welders$Pair <- paste0("Pair ",1:21)
Welders <- Welders[,c("Age","Race","Smoker","ps","im")]

fmatched.controls <- gen.tox.fm %>%
  filter(Group=="Controls") %>%
  arrange(im) %>%
  dplyr::select(Age, Race, Smoker, ps, im)

kable(list(Welders,fmatched.controls), caption="Optimal full match using the Mahalanobis distance within PS calipers")  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 5: Optimal full match using the Mahalanobis distance within PS calipers

Age	Race	Smoker	ps	im
38	0	0	0.46	1
53	0	0	0.19	2
38	0	1	0.60	3
44	0	0	0.34	4
38	0	1	0.60	5
39	0	1	0.57	6
39	0	1	0.57	6
39	0	1	0.57	6
41	0	1	0.53	6
39	0	1	0.57	6
33	1	1	0.51	7
35	0	1	0.65	8
27	0	0	0.68	9
43	1	0	0.20	10
43	0	1	0.49	11
36	0	0	0.50	12
35	0	0	0.52	12
37	0	0	0.48	13
37	0	0	0.48	13
34	0	0	0.54	14
35	0	1	0.65	15

Age	Race	Smoker	ps	im
40	0	0	0.42	1
63	0	0	0.09	2
50	0	0	0.23	2
52	0	0	0.20	2
56	0	0	0.15	2
36	0	1	0.64	3
47	0	0	0.28	4
42	0	0	0.38	4
41	0	0	0.40	4
41	1	1	0.35	4
44	0	0	0.34	4
45	0	0	0.32	4
42	0	0	0.38	4
36	0	1	0.64	5
39	0	1	0.57	6
31	1	1	0.55	7
35	0	1	0.65	8
30	0	0	0.63	9
48	1	0	0.14	10
56	1	1	0.13	10
51	1	0	0.12	10
44	0	1	0.47	11
35	0	0	0.52	12
38	0	0	0.46	13
34	0	0	0.54	14
34	0	1	0.67	15

Efficiency

(be continued)

Summary

Matching on one variable, the propensity score, tends to produce treated and control groups that, in aggregate, are balanced with respect to observed covariates;

\(\rightarrow\) Cons: individual pairs that are close on the propensity score may differ widely on specific covariates.

\(\rightarrow\) Solution: to form closer pairs, a distance is used that penalizes large differences on the propensity score, and then finds individual pairs that are as close as possible.

Pairs or matched sets are constructed using an optimization algorithm. Matching with variable controls and full matching combine elements of matching with elements of direct adjustment.

Full matching can often produce closer matches than pair matching.

Basic Tools of Multivariate Matching

Motivation

A Small Example

Load dataset

Descriptive Statistics

Notation

Propensity Score

Conduct `PS`

Distance Matrices

Squared difference in the \(\hat{e}(\textbf{x})\) distance

Justifying by adding caliper on the `PS`

Mahalanobis distance matrix

Sample var-cov matrix

Inverse of sample var-cov matrix

Mahalanobis `dmat` Rubin (1980)

Mahalanobis dmat within `PS` calipers.

Cons of Mahalanobis distance

Solution `rank-based Mahalanobis distance`

Optimal Pair Matching

Match pair on `PS`

Match pair on `Mahalanobis distance within PS calipers`

Match pair on `rank-based Mahalanobis distance within PS calipers`

Optimal Matching with Multiple Controls

Optimal Full Matching

Efficiency

Summary

Further Reading

References

Corrections

Reuse

Citation

Basic Tools of Multivariate Matching

Motivation

A Small Example

Load dataset

Descriptive Statistics

Notation

Propensity Score

Conduct PS

Distance Matrices

Squared difference in the \(\hat{e}(\textbf{x})\) distance

Justifying by adding caliper on the PS

Mahalanobis distance matrix

Sample var-cov matrix

Inverse of sample var-cov matrix

Mahalanobis dmat Rubin (1980)

Mahalanobis dmat within PS calipers.

Cons of Mahalanobis distance

Solution rank-based Mahalanobis distance

Optimal Pair Matching

Match pair on PS

Match pair on Mahalanobis distance within PS calipers

Match pair on rank-based Mahalanobis distance within PS calipers

Optimal Matching with Multiple Controls

Optimal Full Matching

Efficiency

Summary

Further Reading

References

Corrections

Reuse

Citation

Conduct `PS`

Justifying by adding caliper on the `PS`

Mahalanobis `dmat` Rubin (1980)

Mahalanobis dmat within `PS` calipers.

Solution `rank-based Mahalanobis distance`

Match pair on `PS`

Match pair on `Mahalanobis distance within PS calipers`

Match pair on `rank-based Mahalanobis distance within PS calipers`